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A Computer Network Controller 
Background fjerthelnventlon 

The present invention relates generally to the field of conriputer networks, 
, and in particular to a general computer network controller and a nriethod for local 
s and remote asynchronous completion control in a system area network. 

Moat exlstinfl high-performance network eontroilers have to fee managed by 
the operatirtfl system at by km6\ agdhts ift otd^t tu guafar»t©6 protected 
accesses across different nodes. Users have to do system calls to remote 

• memory through high latency programming interfaces. In addition, explicit 
10 synchronization and completion control decreases the sustained bandwidth 

between users. These problems are described in "An Implementation and 

Communication between network controllers In a system area network 
IS (SAN) is handled by switching fabrics and point-to-point linl<s. Among the 

situations which create network congestion are. I) a failured network component, 

• li) a. high-performance node sending packets into a low-performance node, 

• iii) several nodes sending data packets to one particular node (thus creating a hot- 
node). If such a congestion problem is not handled properly, network throughput 

20 will be reduced. 

US patent no. 5,613.071 (Rankin et al.) discloses a method and an 
apparatus for providing remote memory access in a distributed memory multi- 
processor system. 

Further. US patent no. 5.91 5,088 (Basavaiah et al.) discloses a multl- 
25 processor system that is configured so that each CPU of the system has access to 
at least portions of the memory of any other CPU. 

These two patents describe a more general way of doing RMA with 
address mapping which has been available for some time. They do not refer to 
any implementation issues or method of optimization. 

Particularly In a o^ork. it is important to find a method of scheduling 




NR. 324 S.2/36 

U 012642-4 




7..MP1R.2000 12=05 BR^St fiflRFLOT fVS NORUftY 



packefeO^sed on e.g. priorities, control messages, data messages, and to avoid 
congestionTXtscLWhen managing virtual channels (virtual channels are described 
more fully in co-pen^lng^JS patent application no. ... , 'Virtual channel flow 
control...", assigned to thea^felSQae of the present application, the relevant 

6 disclosures of which co-pending ftppltoajion are Incorporated herein by referoncs). 
a solution must be found regarding the pro^teRjof providing a method of reacting 
to the flow control lrTformation provided by the SAI^ layer. 

Generally, a network controller fonwards packets to the attached bus as 
they arrive from the networic. That Is, the bus to which the networt< controller is 

10 attached, may not necessarily be utilized to Its optimum. This leads to a possible 
problem of decreased bandwidth, 



Summary of the Invention 

The computer network controller of the present invention solves or at least 
15 alleviates the problems of the prior art as stated above. The network controller of 
the present invention solves or alleviates the congestion problem by its inherent 
ability to do implicit fabric rate injection control. The networi< controller of the 
present invention also solves or alleviates the problem of reacting to flow control . 
intomiation from a link layer, by having the ability to schedule packets onto 
20 different virtual channels depending on congestion information received from the 
switching fabric. Furthermore, in order to utilize an attached bus to its optimum, 
the network- controller of the present Invention decouples the data packet size in ■ 
the networi< from the packet size ofthe bus to which the network controller is 
attached. Furthermore, the network controller ofthe invention processes tasks In 
26 parallel in order to meet the required bandwidth from both front-end and back-end 
buses. 

Thus, In the most general embodiment of a first aspect of the present 
invention, there is provided a general computer networi< controller, preferably 
operative in a System Area Networit. which networt< controller includes a data 
30 buffer handling payload as well as a dedicated, programmable micro sequencer 
handling control flow and being capable of mnnlng different network packets and 
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protocols, being packet format independent and network independent. The 
programmable micro sequencer Is tightly coupled to a fully associative context 
block for control thereof, and the context block is operative to hold a number of 
last recently used contexts to provide a dynamic resource allocation schsm© 
reflecting run lime eituatiohs. Substantial parts of the (Sontsxta are updatecj by the 
rnicro sequencer, by an inbound scheduler and by a network protocol engine. 

Preferably, the micro sequencer is operative to control a scalable memory 
array which can be used as a table for inbound address mapping of registered 
memory and access protection, and as a means for keeping context Information 
about all active channels. 

In a prefen-ed embodiment of the invention, the fully associative context 
block constitutes a connection between the inbound scheduler and the network - - 
protocol engine, thereby giving the network controller the ability to pipeline tasks 
and execute In parallel. 

In the same prefered embodiment, the context block may also bo operatlvQ 
to have contexts dynamically allocated between inbound Remote Direct Memory 
Access (RDMA), inbound Remote Memory Access (RMA) and outbound RMA, 
two upper contexts nevertheless being reserved for locally driven remote direct 
memory access, while the context block contains information including the 
following: 

expected sequence number of next packet for checking, 

input gathering size in order to optimize use of an attached bus, 

packet type defined by the network for a specific virtual channel, 

accumulated packet cyclic redundancy check for data Integrity, 

source addresses, 

destination addresses, 

mapping for RDMA operations. 

dedicated flags like page crossing to do new mapping. 

word count zero detection. 

as well as protection tag check, 
all of these information events from the inbound scheduler, the micro sequencer 
and the network protocol engine to be synchronized by the context block and 
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used by the micro sequencer to invoke, restart, switch or terminate a thread 
immediately. 

In another embodiment of the invention, the micro sequencer is operative to 
control th© network protocolengine which in its turn is operative to perform 
inj©6ti(jn control, based oft feedback from a link layer as well a& InteA/entioh frorn 
an operative system. The network protocol engine Is then operative to schedule 
packets to the network. 

In this embodiment, the network protocol engine may further be informed 
about onto which virtual or physical lane packets are going to be sent, and it may 
also utilize the capability of the data buffer and transmit up to four packets from 
different tasks simultaneously, namely a request and a response to the network 
and a request and a response to an attached bus. - - - - 

In a further embodiment of the first aspect of the invention, the inbound 
scheduler is operative to decode, schedule and invoke running tasks or alloeats 
new tasks, based on 

i) packets, received from the network, 

ii) i memory mapped.operation received from a bus attachment module, 
Ui) descriptors Inserted In work queue fifos by a user application, and 
iv) tasks received from the context block. 

In another aspect of the present invention, there Is provided a method for 
local and remote asynchronous completion control, for use in a System Area 
Network. The System Area Network comprises a plurality of host channel 
adapters, a plurality of target channel adapters and a switching fabric, and each 
respective one of the adapters is constKuted by a computer network controller of 
the type as defined above in the most general embodiment stated, together with a 
bus attachment module and a network interface. In the method of the invention, 
message cyclic redundancy check as well as an address to a remote completion 
queue, e.g. at a target, are attached, by such a micro sequencer, to a last packet 
in a message to be sent from a sender, e.g. a host, to a receiver, e.g. a target. 
Thereby, on reception of the last packet at the receiver and checking for data 
integrity for the whole message transfer by a target micro sequencer, "receipt 
complete" can be signaled directly from the target micro sequencer in the remote 
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process completion queue, and simultaneously a response is made back to the 
sender. TTie sender will then signal "send complete" and status directly to a local 
process. 

As appears from the above, the present invention provides apparatus and a 

6 method for implementation 6f d "netwofk protocol onflln©", I.e. a network 

controller, for use particularly, but not only, in SAN networks, in particular HCA 
(host channel adapters) and TCA (target channel adapters). In a SAN It is then 
referred to an SPE, i.e. a SAN Protocol Engine. 

The present invention is based on a programmable Multi-Cont©)«l Micro 

10 Sequencer (MCMS), running dedicated Instructions optimized for network 

protocols. A dynamic resource allocation scheme reflects the runtinis situations by 
keeping the most recently used tasks in a Fully Associative Context Block (FACB). 
' In connection with the micro sequencer Is a configurable memory array us^d for 
inbound address mapping and access protection, and keeping conteitt information 

IS of all the active channels not currently present in the FACB. Associated with the 
MCMS Is a Data Buffer with a number of read and write ports. This enablos the 
SPE to run different tasks in parallel. Attached to the MCMS Is a NetWork 
Protocol Engine (NPE). scheduling packets based on i) congestion information 
provided by the layer flow control, il) knowledge of the SAN topology (i.e. injection 

20 rate control), ill) priorities of packets. 

The networlux)ntroller of the present invention is capable of running 
multiple protected user-level RDMA with irnplicit completion control. 

The SPE is independent of network packet length. Packet length Is 
programmable in order to improve bus bandwidth by doing input gathering, and 

25 the SPE can therefore optimize the use of the attached buses. 

Brief description of the drawings 

The above and further advantages may be more fully understood by 
referring to the following exemplary description of embodiments, in conjunction 
30 with the accompanying drawings of which: 

Figure 1 presents an overview of a System Area Network. 
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Figure 2 presents a general purpose network packet. 
Figure 3 presents a block diagram of a general HCA/TCA including a SPE. 
Figure 4 presents d detailad bloeK diagram of an SPE In sccordance with a 
preferred embbdirii^nt dfthd Invention. 

Figure 5 illustrates Local and Remote Completion Control In accordance 
with a preferred embodiment of another aspect of the Invention. 

Detailed description 

The computer network controller of the present invention can be applied In 
any computer "network (a IJ\N. a SAN )7 but It ^^^^ 
particularly well suited.for use in a System Area Network (SAN). The ©mbodiments 
described in the following will refer to a SAN application. Also in the following, the 
term SAN Protocol Engine (SPE) will be used as a synonym for "computsr 
network controller". 

A System Area Network (SAN) 1 is depicted In Figure 1 . A SAN Is a 
network which interconnects a plurality of computers (hosts) 2 and a plurality of 
lO-devices 8, and/or lO-subsystems. This enables Inter-Processor 
Communication (IPC) (or clustering), host-to-peer (10) communication, and peer- 
to-peer communication, over the same network- The host SAN access point is 
called a Host Channel Adapter (HCA) 6, while the peer SAN access point is called 
the Target Channel Adapter (TCA) 3. Interconnection between HCAs and/or TCAs 
is handled by high-performance point-to-point links 5 and switching fabrics 4. 
Communication between HCAs and/or TCAs is either achieved by sending 
massages, or by doing meniory-mapped communication (e.g. DMA , Direct 
Memory Access) and/or Programmed-IO (PIO) from a local node to a remote 
node. Usually the following transfer models are supported: 

• Acknowledged connection-oriented 

• Unacknowledged connection-oriented 

• Unacknowledged connection-less 
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All transfer methods are based on partitioning data Into network packets by a 
SPE. A data packet 7 is shown in fig. 2, and Is constituted by a packet header 12, 
a payload part 14 and a packet trailer 11. Each packet header 12 contains at least 
i) a destination address (Destination ID) 13 describing the network address of^ the 

6 packet's destination and to be used by the switching fabric 4 to route the packet 7 
to the correct destination. II) a source address 1 5 describing the network address 
of the sender of the packet, ill) a command 17, describing the function the receiver 
of the packet should perform, and Iv) a sequence number 18. If the packet 
contains data (payload), an address notification 16 is required, so the receiver vAll 

10 know where to put the data. 

Each packet trailer 1 1 is required to have an error-detecting cod©, usually a 
T cycirc-redurjdancy check (CRC); to secure data integrity of the complets packetr 
Packets are always received in the order they were sent, i.e. the switching 
fabric 4 does not re-order packets during normal operation. 

15 Fig. 3 shows a simplified block diagram of a general HCA/TCA 3, 6, and 

Indicates on respective sides of a SAN Protocol Engine (SPE) 20 a Bus 
Attachment- Module (BAM) 1Q and a network interface 21 , connected to another 
network unit through a bi-directional point-to-point link 5. 

A block diagram of an embodiment of the present invention is shown in 
20 Figure 4. As mentioned with respect to fig. 3. the SPE interface to the host bus or 
peer bus Is referred to as the Bus Attachment Module (BAM) 19. The SPE 
interface 21 to the network is referred to as the network layer. 

The present invention uses an inbound scheduler 22 to decode, schedule 
and invoke currently running tasks or allocate new tasks, based on i) packets 
25 received from the network, ii) memory mapped operations received from the BAM 
19, iii) descriptors inserted in work queue fifos 23 by the user application, and 
iv) tasks received from a fully associative context block (FACB) 24. The inbound 
scheduler 22 invokes a multi-context micro-sequencer (MCMS) 26 by a special set 
of instructions. 

30 The present Invention supports the concept defined in the Infiniband 
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Architecture scheduled to be released mid 2000. That means that messages and 
DMA transfers can be managed directly between users without intervention from 
the system kernel. In practice this means that user address space on one node Is 
mapped directly to user space on a remote node. Infiniband defines a set of 
Channels with fixed mapping between Iccal and remote memory. 

An Address Translation Table (ATT) contained In a block 26 is setup once 
by the kernel agents (device drivers) on both sides of the connection, when the 
memory is registered. Unique contiguous address space is then exported to the 
users, and Is used, as reference. This means that the HCA/TCA has the notation 
of both local physical memory and the virtual remote memory through Its inbound 
and outbound mapping tables, and remote traffic is managed from a set of 
"chained descriptors set up difectly by users: Blocra&ls a configurabb memory 
array that is used for inbound address mapping and inbound/outbound access 
protection. Additionally, block 26 keeps context information of all active channels 
that are not currently present In the FACB 24. Memory array 26 is controlled and 
updated by micro sequencer 25.. 

The ATT size Is programmable, and depends on the number of Queue 
Pairs (QP) supported, and number of bits per Protection Tag (PTag). E.g. an ATT 
vnth 1M entries and 16-bit PTag may have 64k Channels. The ATT is accessed 
for new tasks or when page crossing occurs during RDMA. 

The woric queue fifos.2,3 contain adresses and protection tags of 
descriptors Inserted directly by the user or kernel agent. The present Invention is, 
however, not limited to the use of these fifos. They are merely used as an 
illustration on how communication between the SPE and user application may be 
performed. 

In the preferred embodiment of the present invention, a FACB24 is used to 
hold e.g. the 16 last recently used contexts. The two upper contexts are reserved 
for locally driven RDMA, while the other 14 are then dynamically allocated 
between inbound RDMA. inbound RMA and outbound RMA. The context block 24 
contains source addresses (SourcelD) and destination addresses (destinationID) 
and mapping for RDMA operations, dedicated flags like page crossing in order to 
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do new mapping, word, count zero detection, data buffer management and 
integrity check, events lit<e sequence error, protection tag cliecl<. The FACB 
synchronizes all these events from The Inbound Scheduler 22, the Multi-context 
Micro Sequencer 25 and a Network Protocol Engine 27 (that axseutes the 

fi function of a link^epender^t packet sender and outbound acheduler). so that 
threads are Invoked, restarted, switched or terminated immediately. 

The Multi-context Micro Sequencer 25 is optimized for running networl< 
related instructions. The MCMS itself is packet and network independent 
The SPE 20 can. In the embodiment under discussion, process up to 8 

10 separate data paths simultaneously (4 data paths default). The MCMS handles 
the control flow, while a data buffer 28 handles the payload. Both units exocute 
rndepandently- ThelJata buffer 28 contains up to 4 write ports and 4 read ports, 
for high-efficient data movement. The number of entries is equal to the numbsr of 
FACB entries. The width is programmable. RMDA has dedicated output buffers for 

16 efficient pipelining. 

The MCMS 25 detects and flags immediately (1 cycle) special events like ~ 
page boundary crossing, wbrd-count-zero and different error conditions. New 
tasks are invoked with minimum delay, while task switching is performed in 2 
cycles. 

20 The MCMS can be programmed to gather packets received from the 

network (Input gathering). Thus, the present invention can therefore optimize the 
use of the attached bus1 9. 

The MCMS 25 perfonns on-the-fly data integrity check. Messages can be 
checked either on each page boundary or at the end of the message. Individual 
25 packets are checked by the link layer level. In case of an acknowledged 

connectionroriented trarisfer model, a. negative acknowledge packet Is returned to 
- the sender if the data was checked to be incorrect. If a sender (i.e. network 
controller) does not receive an acknowledge packet within a fixed time period 
(watchdog timeout), the transfer is marked unsuccessful and the SPE will have to 
30 re-transmit the packet(s). 

In case the MCMS receives a negative acknowledgement it will re-transmit 
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the packet. 

The MCMS provides integrated local and remote completion. The Igst 
packet sehtin a message contains both the accumulated message CRC and 
completion control. The SPE on the receiving HCA/TCA &lde can therefora signal 

6 "receiv© oomplet©" directly in tha remote process'e Completion Queue (CQ), and 
simultaneously respond td the Initiator (sender), by eendirig an adcnowledge 
packet. Upon receiving an acknow/ledge response, the initiator then signals "send 
complete" to the local process. No explicit synchronization is needed. The user on 
the remote side can decide whether to poll the transaction status locally, or bsing 

10 invoked by interrupt. The completion control can be described in the following 
scenario, while referring to fig. 5. 

a) TTie FACB 24 on local side detects that w 

immediately to the Host MCMS. The MCMS then extracts the accumulated 
message CRC and the Remote completion Queue address from the ROMA 
16 eontext. dispatches a last" packet to the Transmitter, and switches context. 

b) When-the remote side detects such a packet, the remote FACB checks the 
accumulated CRC and invokes the associated context. The remote MCMS checks 
the flag, writes status to the CQ 29 and switches context. 

c) When the "write response" returns from the BAM, the context is invoked 

20 again and She MCMS sends a response back to the host node, and terminates the 
context, 

d) When this response arrives at the host node, "send complete"and status 
are written to the channel's completion queue, and the context is terminated. 

This scheme will reduce almost all protocol overhead, and sustained user 
25 throughput will Increase dramatically. 

As previously mentioned, the present invention uses a Network Protocol 
Engine 27 to schedule packets to be sent onto the network. The NPE scheduler is 
capable of link injection control, based on feedback from the link layer 21 . 

The SPE may transmit up to four packets from different tasks 
30 . simultaneously, a request and a response to the netvt/ork. and the same to the 
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attached bus. processing 32 bytes pr. cycle (64 bytes with 128-bit wide data 
paths). 

In the above description, reference has been nnade to an embodiment of 
the invention particularly as depleted In the appended drawings. However, it will 
be appreclstad that various modifioatioh& and alterations might ba made by 
persons skilled In the art without departing from the epirit and scope of the present 
Invention- The scope of the invention should therefore only be restricted by the 
daims that follow, or- equivalents thereof. 



